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0 Partially decoded Instruction cacheu 

(£) A microprocdssor partaily decodes instructions 
retrieved from main memory before placing them 
into the . microprocessor's integrated instmction 
cache. Each storage tocadon in the instruction cache 
includes two slots for decoded instructions. One slot 
controls one of the microprocessor's Integer pipe* 
lines and a port to the m i croprocessor's data cache. 
A second slot controls the second integer pipeline or 
one of the microprocessor's ftoallng point units. The 
instructions retrieved from main memory are de- 
coded by a leader unit which decodes the instruc- 
tions from the compact form as stored in main 
memory and places them into the two slots of the 
instruction cache entry according to their functions. 
In addition, auxiliary infonnation is placed in the 
cache entry along with the ir\struction to control 
parallel execution as welt as emulation of complex 
instructions. A bit in each instuction cache entry 
inc^icates wnether the instructions in the two slots are 
independent so that they can be executed in par- 
allel, or dependent so that they must oe executed 
sequentially. Using a single bit for this purpose al- 
lows ^vo dependent instructions to be stored in the 
slots of the single cache entry. 
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Th« pr«««ot invention r«lat9S to rnicroprocas- 
Mf afcniteci»**i and, in particular, to a micropro- 
cMSOf that parttaify decodes instnjctiona retrieved 
from extanwi memory before storing mem in an 
iniemai instruction cacrie, PanlaJly decoded instruc- 5 
liona are retrieved from me internal cacne for aimer 
parallel or sequential execution by muitlpie, per- 
aileC pipelined functional units. 

In recent years, mere has been a trend in the 
design of miooprtcessor arcriitecturee from Com- 10 
piex Instnjction Set Computers (CISC) toward R^. 
duced Instruction Set Computers (RiSC) to achieve 
high performarce while maintaining simplicity o* 
design. 

In a CISC architecture, each macroinstructfon is 
received by me processor must be decoded inter* 
nally into a series at microinstruction subroutine*. 
These microinstruction sutnxitines are then ex- 
ecuted by the microprocessor. 

In a RISC architecture, me number of macroin- 20 
structions which me processor can understand and 
execute is greatiy reduced. Furmer. mose macroii>- 
sinjctions which the processor can understand and 
execute are very basic so that me processor either 
does not decode mem ir«o my microinstructlone » 
(me macroinstructicn is executed in its macro tonm) 
or me decoded microinstnjction subroutine involvee 
very few microinstructions. 

The transition from CISC architectures to RJSC 
architectures has been driven by two fundamental 30 
developments in computer design mat are now 
being sxtensiveiy appUed to microprocessors. 
These devetopments are integrated cache memory 
and optimizing compilers. 

A cache memory is a smail, high speed buffer 35 
located between me processor and main memory 
to hold the instructions and dati most recemty 
used by me processor. Experience shows mat 
computers very commonJy exhibit strong chanctar- 
istics of locality In their memory referencee. That 40 
is. referencee tend to occur frequendy mftw to 
locations mat hm recertty been referred to 
(temporal kxaflty) or to locatione that «e near 
omers that hav* recently been re/enred to (spatial 
tocaijty). As a consequence of this locality, a cache 46 
memory mat is much smaller than main memory 
can capture the large majority of a program's 
memory referencee. Because me cache memory is 
relatively smail. it can be realized from a faster 
mamcry techrroiogy man would be eccncmicai for so 
me much larger main memory. 

Before me development of cache memory 
techniques for use m mainframe computers, mere 
was a large imtjaiance tjetween me cycle time of a 
processor and that of memory. This imbalance was 55 
a result of the processor being realized from rela- 
tively high speed bipolar semiconductor technology 
and me memory being realized from much slower 



magnetic^ore technology. T?i« inherent speed dif- 
ference between logic and memory sourred me 
deveiooment of complex instrucnon sets mat would 
permit me fetching of a single instruction from 
memory to control me operation of me processor 
for several dock c/dee. The imbalance betwewi 
processor and memory speeds was also character- 
istic of me eariy generations of 32^ft micropnxee- 
sors. TTiose microprocessors would commonly take 
i or S dock cydes for each memory access. 

Without me introduction of integrated cache 
memory, it is unlikely mat RISC architeoures 
wouW have become competitive wi'm CISC archi- 
tecturee. Because a RISC pnxessor executes 
more instructions than does a CISC processor to 
accomplish the same task, a RISC processor can 
deliver perfonmance equivalent to mat of a CISC 
only if a faster and mor* expensive memory sys- 
tem te emptoyed. Integrated cache memory en- 
ables a RISC processor to fetch an instruction m 
the same time required to execute the instruction 
by an efficient processor pipeline. 

The second deveiopcnwit mat has led to me 
effecthmness of RISC architectures is optimizing 
compilers. A compiler, which may be implemented 
in either hardware or software, translates a conw 
puter program from the high-fevel language used 
by the programmer into th« machine language un- 
derstood by the computer. 

For many years after the introduction of high- 
level languages, computers were still extensively 
programmed in assembly language. Assembly lan- 
guage is a tow-level sourca code language employ- 
ing aude mnemonics mat are more easily remem- 
bered by the programmer man object-code or bi- 
nary equivalents. TTie advantages of improved soft- 
ware productivity and transiatabifity of high-level 
language programming were clear, but simple 
compilers produced Ineffioent code. Eariy genera- 
tions of 32*bit microprocaesors were developed 
with conskleratkxi for assembly language prograrrv 
ming and simple compilersL 

More recently, advancae in compiler technol- 
ogy are being appUed to microprocessors. Optimiz- 
ing compilers can analyza a program to allocate 
large numbers of registers efffdently and to marv 
age processor pipeline resources. As a consa- 
quence. high-Jevel language programs can execute 
wim performance comparatala to or exceeding mat 
of assembly programs. 

Many of the leading pioneers tn RISC develop- 
ments have been compiler specialists who have 
demonstrated mat optimizing compilers can pro- 
duce highly efficient code for simple, regular ar- 
chitectures. 

Highly integrated single^ip microprocessors 
employ bom pipelined and parallel execution to 
improve performance. Pipelined execution means 



2 



EP 0 459 232 A2 



white fnicrooree»a««. ,. . . ^ 
"action, it an b« «n^^ ! 
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♦ ^ j'^a^O OT«m into th# micropfocoaaof'i in, 
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Tha Mcond Slot comroia a 

fl«flng po.nt unit* or a »«f„ -^^S^n 
An instruction dacodlno "nit. or loadarT^' 

pa™w axacutian and emulation of compiax 
SS^^ "c^ each. mayS; 
*hatt,* lf» Inatnjctlon. In lf» iwo jlot, for that 
'ndapandent » that thay can ba e^ 

6e execm^i sequentiaily. Using a singla bit for tS 

3tomd^ the slot, of a singl. cacha entry. Otfi*! 
«"sa. tfM two insiHictiona would hava to ba stored 

separata entriea and only or^at, of ma 
memory would ba utitaad in moaa two entriaa 
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set form an Ulustraiiva embodiment in wnich 
pnndplM of m, invention are utiliiad 

C8^^. "^"Porata, 9,a con- 

cepts of ma preaent. invention 

PJgi^e 2 ia a block diagram illustrating tf,a 
0^ . partially decoded in«„^on L-^a 
utilized in me Rg. I srchitacium 

.f.,2^\* " "'"strating tha 

m«tura Of ma imager pipallna. utilized in m, 
'» mKrepnxeswr archifactura shown in Fig 7 

c-J?",^ * "'^ *''«^ a micropro. 

«^ 1 "* P'P««"*1 'unc- 

tion* unrt, tha, capable of .xecutInQ n1 
instructlona In paralla*. ««cut(ng two 

TTH» instruction procaaaor 12 includes mnM 
modules: an instnwuon loader is ,n , '"'^ 

« 20 and «, i„«S, ih^ STaS 

ni«lul« toad instruction. Iron, m, ..t.rS'.yr^ 
B^ugh ma bta intertic, procasaor ^ 

paw of ,n«,uctions to ma axacufion pnx8«or M 
w far exacutiaa "^aaor i • 

^n'H^- processor U includes two 

stage p«oai,nad integer execution units 24 and 29 
• double^r*3s.on 5^ p,pe,in« ^ ^ 

exi«rt«,unrt28.andat024 byt.dataci2«' 0 
w A «« of integer register, 32 «„vices tha two 

STunTS" 34 s.™ ma floa^ng polm exec^ 

3a Th. bus interfaca unit 38 contrets ma bus 
a«a«as r«,ui«tad by botft m. in«niction jjroces- 
Mr 12 and ma axacutian processor 14. m me 
illuatrated embodiment ma system module, m 

^ Aa desoibed in greater detail below, tha in- 
50 instniction, 
» "^"v*! from mam memory and places ma par. 
S^^Kr*^* in«n«lons in ma in«ruct>on cache 
22. TDat IS. ma instnjcHon loader 18 translates an 
IJ^iruction stored in main memory (net shown) into 
tha decoded forniat of ma instnjctlon cache 22 Aa 
» *.M also ba de«:nb.d in o,«ar data., below ' -j^ 
instnjction lo«J.r 18 i, raaporsibla for check- 
ing whemar any depandendes «x>'st between ccn- 
secuave .natnjctiona that are paired in a singl. 
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mstructon cacfia entry. 

TTie instruoion cache 22 contains 512 eniries 
for partaily-decodid instruction*. 

(n accofdanca with ona aspect of the preaant 
mvantion. and u axptainad in greater datai) betow, 5 
aacn entry in tfie inatrxicdon cache 22 containa 
either one or tv¥0 instructions stored In a partaily 
decoded format for affidant control of the various 
functionaJ units of the microprocessor 10. 

In accordance wim another aspect of the 10 
present invention, each entry in instruction cache 
22 aiso contains aujdfiary information that indicatee 
whether the two instructions stored in that entry are 
independent so that they can be executed in par- 
aileJ. or dependent so that they must be executed is 
sequentially. 

The instruction amuiatt>r 20 executes special 
instructions daffnad in the instruction set of the 
microprocessor 10. When the instruction loader 18 
encounters such an instruction, it transfer* contro* so 
to the emulator 20. The emulator is responsible for 
generating a sequence of core instructions (defined 
below) that perform the function of a single com- 
plex instruction (defined below). In this regard, the 
emulator 20 provides ROM-resident micitxode. ss 
The emulator 20 also controls exception' processir^ 
and self-test operations. 

The two 4-staQe integer pipefines 24 and 28 
perform basic arithmetiG/logical operations and data 
memory references. Each integer pipeline 24^ x 
can execute instructions at a throughput of one per 
system clock cycle. 

The floating point execution unit 28 includes 
three sub-units that perform single-precision and 
doubte-preasion operatione. An FPU adder suth js 
unit 2aa is responsible for add and convert oper^ 
ations. a second sub-unit 286 is responsible for 
multiply operations arid a third sub-unit 28c is 
rssponstbte for divide operations^ 

When 8dd and muttipiy operations are altar- 40 
nateiy executed, the floating poim execution unit 28 
can execute instructions at a throughput of one 
instruction per system dock cycle. 

Memory references for the floating point ex- 
ecution untt 28 are controlled by one of the integer ^ 
pipelines 24:28 and can be performed in parallel to 
floating-point operations. 

Data memory references are performed using 
the 1 -Kbyte data cache 30. The data cache 30 
provides fast on-chip access to frequently used so 
data, in the event that dau are not located in the 
data cache 30, dnen off-chip references are per- 
formed by the bus interface unit (BlU) 38 using the 
pipelined system bus 48. 

The data cache 30 employs a load scheduling $s 
technique so that it does not necessarily stall on 
misses. This means that the two execution pipe- 
lir^s 24^9 can continue processing instructions 



and initiating additional memory references wnue 
data IS being read from main memory. 

The bus interface unit 38 can receive raquests 
for mam memory accasses from either the instruc- 
tion processor 12 or the execution processor 14. 
These requests are sent to the external pipelined 
bus 4a The external bus can be programmed to 
operate at haif the frequency of the microprocessor 
10: this allows for a simple instruction interface at a 
relatively low frequency while the microprocessor 
10 executes a pair of instructions at Full rate. 

The instruction set of the microprocessor 1 0 is 
partitioned into a core part and a non-core part 
The core part of the instruction set consists of 
performance critical instructions and addressing 
modes, together with some special-function instruc- 
tions for essential system operations. The non-core 
part consists of the remainder of the instruction set 
Performance critical insmictions and addressing 
modes were selected based on an analysis and 
evaluation of the operating system (UNIX in tinis 
case) worWoad and various engineering. sdentifU: 
and embec^ed controller applications. These 
instnxtions are executed directiy as part of the 
RISC architecture of microprocessor 10. 

As stated above, special-function and non-core 
instructions are emulated in microprocessor 10 by 
macroinatruction subroutines using sequences of 
core instructions. That is, instructions that are a 
part of aie overall instruction set of the micropro- 
cessor 10 architecture, but that lie outside the 
directiy-implemented RISC core, are executed un- 
der control of the instruction emulator 20. When the 
Instruction loader 1 8 encounters a non-core instruc- 
tion, it either translates it into a pair of core instruc- 
tions (for Simple instructions like MOVB i(R0).O- 
(R1)) or transfers control to the Instruction emulator 
20. The instruction emulator 20 Is responsible for 
generatirig a sequence of core instructions that 
perform the ftjnction of ttie single, complex instruc- 
tion, 

Hg. 2 shows the structure of the instruction 
cache 22. The instruction cache 22 utifizM a 2- 
way. set-assodative organization witit 512 entries 
for partially decoded instructions. This means that 
for each memory address there are two entries in 
the instruction cache 22 where the instruction lo- 
cated at that address can be placed. The two 
entries are called a "set". 

As shown in Rg, 3, each instruction cache 
entry irKludes two slots, t.e. Slot A and Slot B. 
Thus, each entry can contain one or two partially- 
decoded instructions that are represented with 
fixed fields for opcode (Ope), source and destina- 
tion register numtiers (R1 and R2. respectively), 
and immediate values (32!) IMM). The entry aiso 
includes auxiliary information used to control the 
sequence of instixiction execution, irxrluding a bit P 
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tfiat indicate* wn«th»r m« entry contains two ccn- 
s«cutiv« instructiona mat can oe wecuted in par- 
aJial and a bit Q that indicates whether the entry is 
for a comclaat Instruction that is emulated, and 
addttonai informanon reprasentlng the length of the 
instnjction(s) in a form that ailows fast caJcuiatico 
of the next instruction's address. 

Referring bade to Rg. 2, assodatad with each 
entry in the instruction cache 22 is a 29-bit tag, 
TAGO and TAGl. respectively, that holds the 22 
mos^sig^iflcam tsrts, 3 least-significant bits and a 
User/Supervisor bit of the virtual address of the 
instruction stored in the entry, tn the event that two 
consecutive instructions are paired in an entry, tfta 
tag con-esponds to the instruction at the lower 
address. Associated with the tag are 2 bits that 
indicate whether the entry is vaiid and whether it Is 
locked. For each set there is an additlonai singie 
bit that indicates the entry within the set that Is 
next to be replaced in a t.dast*Hecantiy-Used 
(LRU) order. 

The instruction cache 22 is snaPied for . an 
instruction fetch if a con^ponding bit of the cort- 
figuration register of microprocessor 10 which is 
used to 9f)atl% or disable various operating modes 
of the microprocessor 10. is 1 and either address 
translation is disabled or the CHait is 0 in the levet* 
2 Page TaPie Entry (PTH) used to translate the 
virtuai address of the instruction. 

if the instruction cad^ 22 is disabled, then the 
instruction fetch bypasses the instruction cache 22 
and the contents of the instruction cache 22 are 
unaffected. The instruction is read dlrectty from 
main memor/, partiaily decoded by the instruction 
loader 18 to form an entry (which may contain two 
partiaily decoded instructions), and transfer^ to 
the integer pipelines 24. 26 via the IL BYPASS One 
for execution. 

As shown in Rg. 2. if the instruction cache 22 
is enabled for an instruction fetch, then eight bits, 
i.e. bits PC{10:3), of the instruction's address pro- 
vided by the program counter (PC) are decoded to 
select the set of entries where tfw instruction may 
be stored. The selected set oe four entries Is read 
and the assodated tags are compared with the 23 
most-significant bits. i.e. PC(31:10), and 2 (east- 
significant bits PC<1.<3) of the instruction's virtuai 
address, tf one of the tags matches and the match- 
ing entry is valid, then the entry is seteaed for 
transfer to the integer pipelines 24J28 for execution. 
Othervrise. the missing instruction is road directiy 
from mam memory and pardaily decoded, as ex* 
piained t;eiow. 

If the referenced instruction is missing from the 
instruction cache 22 and the contents of the 
iected set are ail locked, then the handling of the 
raferei'K:e is identical to that deschbed above for 
the case when the instruction cache 22 is disabled. 



If the referenced instruction is missing from the 
instruction cache 22 and at least one of the entnes 
in lie selected set is not locked, then the following 
actions are taken. One of the entries is selected for 
5 replacement according to the least recentiy used 
(LAU) replacement algorithm and then the Lf^U 
pointer is updated. If tiie entry selected for replace* 
nnent is locked, then the handling of the reference 
ia identical to that described above for the case 
TO when the Instruction cache 22 is disabled. Other- 
wts«. the missing instrtjction is read directiy from 
externa* memory and then partially decoded by 
instruction loader 18 to form an entry (that may 
contain two partially decoded instructions) which is 
15 transferred to the integer pipelines 24,26 for execu- 
tion, tf ClIN is not active during tiie bus cycles to 
read tiie missing instruction, then the partially de- 
coded instruction Is also written into tiie instnjction 
cache entry selected for replacement ti^e asso- 

20 dated valkd bit is set and the entry is locked if 
Lock-lnstnjction-Cache bit CFG.UC in the configu- 
ration regtstar is 1 . 

After ttie microprocessor 10 has completed 
fetching a missing instruction, from extamai main 

3$ memory, it will continue prefetching satqutrtiak 
instructions. For subsequent sequentiaJ instruction 
fetchea, tfie microproceesor 10 searchee the in- 
struction cache 22 to determine whether the in- 
struction is located on-chip, tf the search is sue- 

JO csssful or a non-sequential Instruction fetch occurs, 
then the microproceesor 10 ceases prefetcnir^. 
Otherwise, tiie prefetched instructions are rapidly 
availabie for decoding and sxecuting. The micro- 
processor 10 initiates prefetches only during bus 

3S cyciee tfiat would otherwise be idle because no off- 
chip data references are required. 

ft is possible to fetch an Instruction and lock rt 
Into the instruction cache 22 without having to 
execute t^e instruction. This can be accomplished 

40 by enabling a Debug Trap (DBG) for a Pmgram 
Counter value that matches two instruction's ad- 
dress. Debug Trap Is a sernce routine that per- 
forms actions appropriate to this type of exception. 
At the conclusion of tiie DBG routine, the R£tum to 

4S Execution (RETX) Instruction is executed to resume 
executing instructions at ti^e point where tiie excep- 
tion was recognized. The instruction will be fetched 
arid placed into tiie Instruction CdcJ^ 32 before the 
trap is processed. 

50 When ttie instruction which Is locked in tine 

instruction cache 22 gets to execution and a Cebug 
Trap on that instruction Is enabled, instead of ex- 
ecuting tiio instruction, tiie processor will jump to 
the Debug Trap saivica routine. The service routine 

SB fTiay set a breakpoint for tiw next instruction so that 
when tiie processor returns from the service rou- 
tine, it will not execute tiie next instruction but 
rather will go again to tt>e Debug Trap routine. 
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Th# procaw described abov«. wnicft usually 
Q6U 9x«cuM during system bootstrap, allows the 
uMT to jtoi« routin«« in th« tnstnjction cacne 22. 
lock them and heve them roady for operation with- 
out executing ttiem during the locking process. 

Furtfier information relating to tfie architecture 
of microprocessor 1 0 and its cache kxrking capapil* 
I ties is provided in commonty<assigned application 

Seriai No, . filed on the same data as 

this application and titled SB^CTTVELY LOCKING 
MEMORY LOCATIONS WITHIN A MICROPRO- 
CESSOR'S ON-CHIP CACHE: the just-fefarenced 
application serial No. is hereby in- 
corporated by reference to provkle further back- 
ground information regarding the present invention. 

The contents of the instruction cache 22 can 
be invalidated by software or by hardware. 

The instruction cache 22 is invalidated by soft- 
ware as follows: The entire instruction cache corv 
tents, including locked entries, are invalidated while 
t?it CFQJC of the Configuration Register is 0. The 
LRU reptacement infomiation is also initiaiiied to 0 
while bit CFQJC is 0. Cache Invalidate CINV in- 
struction can be executed to invafidats the entire 
instruction cache contents. Executing CINV invaJI- 
dates either the entire cache or only unkxked lines 
according the instruction's L-option. 

The entire instruction cache 22 is invalidated in 
hardware by activating an INVIC input signal. 

Rg. 3 shows a simplified view of a partially 
decoded dntry stored In the instmction cadie 22. 
As shown in Rg, 3, each entry has two slots for 
instructions. Stot A controls integer pipeline 24 and 
the port to daU cache 30, Slot 3 controls the 
second integer pipe 28» or one of the floating point 
units or a control transfer instruction. Slot 8 can 
also control the port to daU cache 30. but only rf 
slot A is not using the data cache 30. As stated 
atMve. instruction k>ader 18 retrieves encoded 
instructions from their compact format in main 
memory and places them into skyts A and B ac- 
cording to their functions. 

Thus, in accordance with the present invention, 
the novei aspects of instruction cache 22 Include 
(1) partiaUy decoding instructions for storage In 
cache memo*Y. (2) placing of instructions into two 
cache slots according to their function and (3) 
placing auxiliary information in the cache entries 
aiong with the instructions to control parallei execu- 
tion and emulation of complex instructions. 

As further shown in Rg. 3. a bit P in each 
instruction cache entry indicates whether the 
instructions in slots A and 8 are independent so 
tJiey can be executed in parallel, or dependent so 
they must be executed sequentially. 

An example of independent instructions that 
can be executed in parallel is: 
Load 4<R0)^1 ; Added 4.R0 



An example of dependent instructions recujnng 
seouentiai execution is: 
Addd flO. Ri ; Addd R1.R2 

Using a single bit for this purpose allows two 

5 dependent instructions to be stored in the slots of a 
single cache entry, otheixrise, the two instructions 
would have to be stored In separate entries and 
only 1/2 of the instruction cache 22 would be 
utilUed in those two entries. 

'0 Rg. 3 also shows a bit Q in each instnjction 

cache entry that indicates whether the instructions 
in slots A and B are emulating a single, more 
complex instruction from main memory. For exam- 
ple, the loader translates the single instruction 

It ADOO 0<RO). RI into the following pair of instnic- 
tions in slots A and 8 and sets the sequential and 
emulation flags in the entry: 
Load 0<R0). Temp 
ADDD Temp, Rl 

20 In accordance with the pipelined organization 
of the microprocessor 10. every instruction ex- 
ecuted by the microprocessor 10 goes through a 
series of stages. The two integer pipelines 24, 20 
(Rg. 1) are able to work in parallel on instructions 
pairs. Integer unit 24 and integer unit 28 are not 
identicat the instructions that can be executed in 
integer unit 24 being a sut^set of those that can be 
executed in integer unit 20. 

As Stated above, instruction fetching is per- 

30 formed by the instmction toader 18 which stores 
decoded instrxjctxra in the instruction cache 22. 
The integer dual-pipe receives decoded instruction- 
pairs for execution. 

Referring again to Rg. 3, as stated above, an 

39 instruction consists of two slots: Slot A and 
Stot 8. The instnjction in Slot A is scheduled for 
integer unit 24; the instruction in Stot 8 is sched- 
uled for integer unit 20. Two instnjctions belonging 
to the same pair advance at the same time from 

40 one stage of rfm integer pipeline to the next except 
in the case when the instruction in Slot 8 is de- 
layed in the instruction decode stage of t\h pipe- 
line as descn'bed below. In this case, the instruc- 
tion in integer pipeline 24 can advance to the 

45 following pipeline stages. However, new instruc- 
tions cannot enter the pipeline until the instruction 
decode stage is free in both pipeline unit 24 and 
pipeline unit 29, 

Although the urut 24 and unit 26 instructions 

50 are executed in parallel (except in the case of the 
Stan )D-a instnjction). the Slot A instmction always 
logically precedes the corresponding Slot 8 in- 
struction and. if the Slot A instruction cannot be 
completed due to an exception, then the corre- 

sa spending Slot 8 instmction is discarded. 

Referring to Rg. 4. each of ti^ integer pipeline 
units 24, 23 includes four stages: an instmction 
decode stage (10). an execute stage (EX), a mem- 
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ory aca,« stag. (ME) and a store result stage 
(ST). 

An instnjcjJon 13 fed into me 10 stage of me 
tnteger unit for wntch it is scfteduieO where its 
decoding is conipietwj and register source 5 
operands are reed, in me EX stage, me 
anmrnetic/logicai unit of me microprocessor lO Is 
activated to compute me instruction's resuiti or to 
compute me effective memory address for 
U3ad/Store instructions. In me ME stage, me data ro 
cache 30 (Rg. 1) i, accessed by Load^iora 
instructions and exception conditions are checked. 
In me ST stage, results are written to me register 
file, or to me data cache 30 in me case of a Store 
instruction, and Program Status Register (PSR) rs 
fiaga are updated. At mis stage, the instruction can 
no longer be undone. 

As further shown in Rg. 4, results from me EX 
stage and me ME stage can be fed back to me ID 
stage, mus enabling instruction latency of 1 or 2 » 
cycles. 

In the atisence of any delays, me duaJ execu- 
tion pipeline of microprocessor 10 accepts a new 
instnjction pair svery clock cycle (I.e., peak 
mroughput of two instructions per cycle) and » 
scrolls aJt other instructions down one stage atoog 
me pipeline. The duai pipeline includes a giodaJ 
stalling mechanism by which any functional unit 
can stall me pipeline if it detects a hazard. Each 
stalls me corresponding stage and all stages pre- jo 
ceding it for one more cycle. When a stage stalls, 
it keeps ma instruction currentty residing in it for 
anomer cycle and men restarts all stage activities 
exactiy as in me non-stalled case. 

The pipeline unit on which each instructwn is 3$ 
to be executed is determined at run time by me 
instruction toader 18 when instructions are fetched 
from main memory. 

The instruction loader 18 decodes prefetched 
Instructions, trfes to pack mem into Instruction pair 4q 
entries and presents mem to m« duaJ-pipetina. tf 
me tnstnjction cache 22 la enabled (as discussed 
above), cacheable Instnxtkans can be stored in m« 
instruction cache 22. In this case* an entry contain- 
ing an instnjction pair or a single instmction is also 4S 
sent to me instruction cache 22 and stored mere as 
a single cache entry. On instruction cache hits, 
stored instnjction pairs are retrieved from the in- 
struction cache 22 and presented to me duaJ-pipe- 

line for execiition, 

_ . so 
The insirucaon loader 18 attempts to pack 
instructions into pairs whenever possible. The 
packing of two instructions into one entry is pcssi- 
bfe only if me first instruction can be executed by 
integer pipeline unrt 24 and bom instructions are ss 
iess man a preselected maximum lengm. ff it is 
tmpossibte to pack two instructions into a pair, men 
a single instruction is placed in Sk3t 8. 



Two instructions can be paired cniy when ail of 
me following conditions hold; (1) both instructions 
are performance-cnticai core instructions. (2) me 
f^rst instnjction is executable by integer pipelina 
unit 24, and (3) me displacement and immediate 
fields in bom instructions use short-encoding (short 
encoding for all instructions except the Branch in- 
stnjction is 11 bits and 17 bits for me Conditional 
Branch and Branch and Unk instructions). 

Several instructions of me microprocessor 10 
instruction set are restrlaed to oin on Integer pipe- 
line unit 28 only. f=or example, because instruction 
paia in me instruction cache 22 are tagged by me 
Skit A address, it is not useful to put a Branch 
instruction in Slot A since me corresponding Slot 8 
instruction will not be accessible. Similarty. since 
mere is a single anmmetic floating point pipe, it is 
not possible to execute hivo arrmmebc floating point 
instnjctions in parallel. Restricting mese instruc- 
tions to integer pipeline unit 28 makes it possible 
to considerably simpUfy ma ouai-pipe data path 
design wimout hurting performance. 

Integer unit 28 can execute any instnjctiona in 
the microprocessor 10 instruction set 

The instrtjction loader 18 initiatoa instruction 
pairing upon an instruction cache miss, in which 
casa it begins prefetching instructions into an in- 
struction queoa. In parallel, ttie instnjction loader 
18 examines me next instruction not yet removed 
from ma instruction queue and attempts to pack it 
according to me following aigorimm: 

Step 1: Try to fit me next instnjction into Slot 

A. 

(a) if me next Instruction is not performance 
critical, men go to Step 5. 

(b) remove the next instnjction from me instruc- 
tion queue and tentatively place it In Stot A. 

(c) if the Instruction is illegal for Slot A or if the 
instruction has an immediata/displacanrwit fleid 
that cannot be represented in 1 1 bits, or if me 
instnxrtion is not quad-word aligned, then go to 
Step 4. 

(d) ottierwise. continue to Step 2. 

Step 2: Try to fit the next instruction into Slot 

B. 

(a) if the next Instruction is not perform ancs- 
criticai. or me next instruction has an encoded 
immediata/dsplacement flekj longer man 11 
bits, or the next instruction is a brary^h wrth 
displacement longer man 17 bits, men go to 
Step 4. 

(b) omenwsa. remove me next instruction from 
the instnjction queue, place it in Stot S and go 
to Step 3. 

Step 3: Construct an Instruction pair entry. 

In mis case, bom Stot A and Slot B contain 
valid instnjctions and alt pairing conditions are sat- 
isfied. Issue a pair entry and go to Step 1 . 
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t Co'Wnjcl a single instruction entry. 

In mi« cast. Slot A contains an instruction 
vvtwdi cannot b« pairBd. Mov* tfiis instruction to 
Slot B. If thia instruction contains an 
immediate/<Jisolac«m«nt longer tfim 17 bits. 5 
or It is a branch with disptacament longar man 17 
bits, and is not quad-word aiign*d, men replace it 
wim UNOeflned. Issue me entry and go to Step 1. 

Step 5; Handle non-performance-crit(ca* 
instructione;." 

Remove me next Instruction from the instruc- 
tion queue and send it to me instruction amuiator 
20. When finished wim this instruction, go to Step 
1. 

TTie just-described pairing aJgorimm pacio two rs 
instructona whenever mey can be hekj in a single 
instnjction cache entry. However, these inttructione 
may happen to be dependent in which case mey 
cannot be executed in parallel. The dependenctea 
are detected by me execution processor 14. 30 

It 3houid be understood that various aiteme- 
tives to me embodiment of the invention described 
herein may be utiUxed in practicing me invention. It 
is intended mat the following claims define the 
scope of me im^on and thai methods and a{> sm 
paratua wimin the scope of these claims and their 
aquivaients be covered mereby. 

aalma 

X 

1. A processor mat executes instructlona re- 
trieved from a main memory external to me 
processor from an interna* instruction cache 
memory, me processor comprising: 

(a) means for retrieving an encoded instruc- m 
Uon from the main memory; 

(b) means for decoding me encoded in- 
struction retrieved from main memory; 

(c) internal cache memory storage meena 

for storing me decoded instructfon; and 4o 

(d) meana tor retrieving the decoded in- 
struction from the internal cache memory 
storage meene for execution by me procee- 
sor. 

45 

Z A microproceseor that executes instructicna re- 
trieved from a main memory external to the 
microprocessor or from an internal instruction 
cache memory, the microprocessor ccmpris- 

(a) a piuraiity of functional units for execuh 
ing instructions; 

(b) means for retrieving encoded instruc- 
tions from main memory; 

(c) moans for decoding the encoded 53 
instructions retrieved from main memory; 

(d) internal cactie memory storage means 
comprising a piuraiity of storage locations. 



each storage location compnang a plurality 
of storage slots, each of me storage slots 
comprising means for storing a decoded 
instruction: and 

(e) means for simuitaneousiy retrieving a 
plurality of decoded instructions from the 
storage slots of a selected cache memory 
storage location for parallel execution by me 
plurality of functional units. 

X A microprocaMor as in claim 2 wherein 
of me cache memory storage locations in- 
dudea mam for storing auxiUary information 
indicative of whether the plurality of instruc- 
tions stored in the slots a cache memory stor- 
age location are independent such mat me 
instructlone may be executed in parallel, or 
dependent such that me instructions must be 
executed saquantially. 

4- A method of executing instructions by a pro- 
cessor that retri«vee instrudtons from a main 
memory extamal to the processor or from an 
internal instruction cache memory, me method 
comprising: 

(a) retri^vwig an encoded instruction from 
me main memory; 

(b) decoding the encoded instruction re- 
trieved from onain memory; 

(c) storing th« decoded instruction in an 
internal cach^ memory; and 

(d) retriefwing the decoded instruction from 
me internal cache memory for execution by 
me pmceesor. 

S. A method of axacuting instructions by a micro- 
processor that redlevee instructions from a 
main memory external to the microprocessor 
or from an intamal instruction cache memory, 
me microprocessor including a plurality of 
functional unds for executing instruction, me 
method compfietfig; 

(a) retrieving encoded instructions from 
main memory; 

(b) decoding the instructions retrieved from 
main merrwary; 

(c) storing the decoded instructions in an 
intemai cache memory storage means com- 
prising a pkraifty of storage locations, each 
storage location comprising a piuraiity of 
storage slots, each of me storage slots 
comprising nneana for storing a decoded 
instruction: and 

(d) simultaneously retrieving a piuraiity of 
decoded inatructions from the storage slots 
of a seiectad cache memory storage kxatio 
for paraliel execution by me piuraiity of 
functional units. 
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8L A mtthod as in claim 5 and inciudino ttie step 
of storing auxiliary infonnation in the cactia 
memory storaQo locattons, the auxiUary infor- 
maticn being Indicative of whether the plurality 
of inatructtona stored in the slots of a cache s 
memory storage locatfon are independent such 
that the instructions may be executed in par- 
allel, or degendent such that the instructions 
must be executed sequentlaUy. 

10 



15 



20 



28 



00 



06 



40 



49 



50 



3 



0 459 232 A2 




10 



EP 0 459 232 A2 



A 
O 



V 

A 
O 



CO 

V 

O 




UJ 

o 
o 



UJ 



5 



1° 

o 
o 



I 



C3 

< o 



cvj 

d 

LL 



11 



BP 0 459 232 A2 



(3 



CD 

o 

CO 



< 

2 

CO 



2 

CO 



o 

O 



2 

CO 



O 
a. 
O 



cc 
O 



Ui 

O 

LU 
CO 

CC 



D3 < 

a: rib 

co„ o 
zee J 
-UJ u. 



a. 2 



f o — 
n^a3 _i 



1^ 

22 



z 

o 



2 



UJ 
D 

a 

UJ 
CO 

cc 
o 

u. 

o 

CO 

§ 

CO 
UJ 

§ 
I 



CO 
LL 



2 



g 

LL 



cc 



2 

o ^ 
z 



12 



Europilsch«s Patentamt 
EuropMn P«tMtt OfflM 
Offlc* •urop^M dM br«v«u 




0 Publication number. 0 459 232 A3 



® EUROPEAN PATENT APPLICATION 

® Appiicaiion numdar 9110789M ® uit ClA G06F 9/38 

(£) Oatttof flRng: HOAJI 



® Priority: 2SL0UO US S29889 

® Date of publication of apptfcxtlon: 
04.1Z31 Bulletin 91/49 

® Oesignatsd Contracting States: 
OEFRGSIT 

® Date of deferred publication of me search report 
15.04^ Bulletin 92^8 



® Applicant NATIONAL SEMICONDUCTOR 
CORPORATION 
2900 S«nilconductor Ortve 
Senta Ctere, CA. 95051-4090(US) 



® 



Inventor Alpert Donald & 
S2, Hanadlv Street 
H«f2il(IL> 

Inventor Avnon, Dror 

3 Hahazavim Street Ramat Poleg 

N«tanya(IL) 

Inventor 8en*Melr, Amos 
24/1 Daniel Morttz Street 
Ramat Avfv(lL) 
Inventor TalmudI, Ran 
42 Hakohev Street 
Raanana(lL> 



® Representative: Sparing Rtf hi Hens#ler 
Patentanwalte European Patent Attorneys 
Rethelatraase 123 
W.4000 0Utseldorf 1(0E) 



® Partially decoded Instruction cadieL 
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® A microprocessor partiatty decodes instuctions 
retrieved from main nnemofy belore placing them 
into the microprocessor's integrated instniction 
cache. Each storage tocatkyi in the tnstnjction cache 
includes two slots fer decoded instructkDns. One slot 
controls one of the microprocessor's integer pipe- 
lines and a port to the mcroprocesaor'i dau cache. 
A second slot controls the second integer pipetine or 
one of the microprocasaor's floating potnt units. The 
instructions retrieved from main memory are de- 
coded by a leader unit which decodes the instruc- 
tions from the compact tenri as stored in main 



memory and places them into the two slots of the 
instruction cache entry according to their functions. 
In addition, auxiliary information is placed in the 
cache entry aioog with the instrxjction to control 
parallel execution as well as emulation of complex 
instructions- A bit in each instmction cache entry 
indicates whether the instructions in the two slots are 
independent, so that they can be executed in par- 
allel, or dependent, so that they must be executed 
sequentiaJly. Using a single bit for this purpose ai- 
lows two dependent instructions to be stored in the 
slots of tne single cache entry. 
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